K-Nearest Neighbors (KNN) — Classification & Regression (From Scratch)#

KNN is one of the most intuitive ML algorithms:

  • store the training data

  • for a new point, look up the K closest points

  • vote (classification) or average (regression)

It’s often described as lazy learning: there’s almost no “training” — the work happens at prediction time.

What you’ll learn#

  • the core idea (with geometry + intuition)

  • the math for classification and regression

  • why feature scaling matters

  • how to implement KNN from scratch in NumPy

  • how to use sklearn KNN models and understand their key parameters

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
rng = np.random.default_rng(7)

1) Intuition: “ask your closest neighbors”#

Imagine you move into a new city and want to predict the rent for an apartment.

You might ask:

“What do similar apartments nearby cost?”

That’s KNN.

  • Similarity is measured by a distance (e.g., Euclidean distance).

  • The “answer” is based on the K nearest examples.

KNN is like a well-organized memory:

  • no complicated training

  • excellent for weird-shaped patterns

  • but can be slow at prediction time if the dataset is huge

2) The algorithm (math + notation)#

Let:

  • \(X \in \mathbb{R}^{n \times d}\) be the training features

  • \(y\) be the targets

  • \(x \in \mathbb{R}^d\) be a query point

2.1 Distances#

We compute a distance \(D(x, x_i)\) between the query point and every training point.

A common family is the Minkowski distance:

\[ D_p(x, x_i) = \left(\sum_{j=1}^d |x_j - x_{i,j}|^p \right)^{1/p} \]
  • \(p=2\) → Euclidean distance

  • \(p=1\) → Manhattan distance

2.2 Classification#

Let \(\mathcal{N}_k(x)\) be the set of indices of the \(k\) nearest neighbors.

Uniform vote:

\[ \hat{y}(x) = \arg\max_c \sum_{i \in \mathcal{N}_k(x)} \mathbf{1}[y_i = c] \]

Distance-weighted vote (common):

\[ w_i = \frac{1}{D(x, x_i) + \varepsilon} \quad\Rightarrow\quad \hat{y}(x) = \arg\max_c \sum_{i \in \mathcal{N}_k(x)} w_i \; \mathbf{1}[y_i = c] \]

2.3 Regression#

Uniform average:

\[ \hat{y}(x) = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x)} y_i \]

Distance-weighted average:

\[ \hat{y}(x) = \frac{\sum_{i \in \mathcal{N}_k(x)} w_i y_i}{\sum_{i \in \mathcal{N}_k(x)} w_i} \]

2.4 Big practical gotcha: scaling#

If one feature is measured in kilometers and another in millimeters, the kilometer feature dominates the distance.

So KNN almost always needs feature scaling (e.g. standardization).

3) Geometry: L1 vs L2 distance “balls”#

Distance isn’t just a formula — it’s a shape.

  • L2 distance (Euclidean) → circles (2D)

  • L1 distance (Manhattan) → diamonds (2D)

This changes which points count as “neighbors”.

# Visualize iso-distance curves around the origin
r = 1.0

# L2 circle
theta = np.linspace(0, 2*np.pi, 400)
x_circle = r * np.cos(theta)
y_circle = r * np.sin(theta)

# L1 diamond: |x| + |y| = r
x_diamond = np.array([0, r, 0, -r, 0])
y_diamond = np.array([r, 0, -r, 0, r])

fig = go.Figure()
fig.add_trace(go.Scatter(x=x_circle, y=y_circle, mode="lines", name="L2: x²+y²=1"))
fig.add_trace(go.Scatter(x=x_diamond, y=y_diamond, mode="lines", name="L1: |x|+|y|=1"))
fig.update_layout(
    title="Iso-distance curves in 2D",
    xaxis_title="x",
    yaxis_title="y",
    width=700,
    height=450,
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()

4) A toy classification dataset (2D)#

We’ll use make_moons because it’s not linearly separable — perfect to show why KNN can be strong.

X, y = make_moons(n_samples=500, noise=0.25, random_state=7)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=7, stratify=y)

fig = px.scatter(
    x=X_tr[:, 0],
    y=X_tr[:, 1],
    color=y_tr.astype(str),
    title="Training set (make_moons)",
    labels={"x": "x1", "y": "x2", "color": "class"},
)
fig.update_traces(marker=dict(size=7, opacity=0.8))
fig.update_layout(width=700, height=450)
fig.show()

5) KNN from scratch (NumPy)#

We’ll implement:

  • pairwise distances

  • “k nearest” lookup

  • classifier (uniform + distance-weighted)

  • regressor (uniform + distance-weighted)

We’ll keep it readable and educational.

def pairwise_minkowski_distances(X_query: np.ndarray, X_train: np.ndarray, p: float = 2.0) -> np.ndarray:
    X_query = np.asarray(X_query, dtype=float)
    X_train = np.asarray(X_train, dtype=float)

    if X_query.ndim != 2 or X_train.ndim != 2:
        raise ValueError("X_query and X_train must be 2D arrays")
    if X_query.shape[1] != X_train.shape[1]:
        raise ValueError("X_query and X_train must have the same number of features")
    if p <= 0:
        raise ValueError("p must be > 0")

    if p == 2:
        # (a-b)^2 = a^2 + b^2 - 2ab trick
        q_sq = np.sum(X_query**2, axis=1, keepdims=True)  # (m,1)
        t_sq = np.sum(X_train**2, axis=1)  # (n,)
        d2 = q_sq + t_sq[None, :] - 2.0 * (X_query @ X_train.T)
        d2 = np.maximum(d2, 0.0)
        return np.sqrt(d2)

    return np.sum(np.abs(X_query[:, None, :] - X_train[None, :, :]) ** p, axis=2) ** (1.0 / p)


def kneighbors_indices_and_distances(
    X_train: np.ndarray,
    X_query: np.ndarray,
    k: int,
    p: float = 2.0,
):
    distances = pairwise_minkowski_distances(X_query, X_train, p=p)  # (m,n)

    if k <= 0:
        raise ValueError("k must be >= 1")
    if k > distances.shape[1]:
        raise ValueError("k cannot exceed number of training samples")

    idx = np.argpartition(distances, kth=k - 1, axis=1)[:, :k]  # (m,k), unsorted
    rows = np.arange(distances.shape[0])[:, None]
    d_k = distances[rows, idx]

    order = np.argsort(d_k, axis=1)
    idx_sorted = idx[rows, order]
    d_sorted = d_k[rows, order]
    return idx_sorted, d_sorted


class ScratchKNNClassifier:
    def __init__(self, n_neighbors: int = 5, p: float = 2.0, weights: str = "uniform", eps: float = 1e-9):
        self.n_neighbors = int(n_neighbors)
        self.p = float(p)
        self.weights = weights
        self.eps = float(eps)

    def fit(self, X: np.ndarray, y: np.ndarray):
        self.X_ = np.asarray(X, dtype=float)
        self.y_ = np.asarray(y)
        self.classes_, self.y_encoded_ = np.unique(self.y_, return_inverse=True)
        return self

    def _weights(self, distances: np.ndarray) -> np.ndarray:
        if self.weights == "uniform":
            return np.ones_like(distances, dtype=float)
        if self.weights == "distance":
            return 1.0 / (distances + self.eps)
        raise ValueError("weights must be 'uniform' or 'distance'")

    def predict_proba(self, X_query: np.ndarray) -> np.ndarray:
        idx, d = kneighbors_indices_and_distances(self.X_, np.asarray(X_query, dtype=float), self.n_neighbors, p=self.p)
        w = self._weights(d)
        neighbor_labels = self.y_encoded_[idx]

        n_query = idx.shape[0]
        n_classes = self.classes_.shape[0]
        proba = np.zeros((n_query, n_classes), dtype=float)

        w_sum = np.sum(w, axis=1, keepdims=True)
        for c in range(n_classes):
            proba[:, c] = np.sum(w * (neighbor_labels == c), axis=1) / w_sum[:, 0]
        return proba

    def predict(self, X_query: np.ndarray) -> np.ndarray:
        proba = self.predict_proba(X_query)
        return self.classes_[np.argmax(proba, axis=1)]


class ScratchKNNRegressor:
    def __init__(self, n_neighbors: int = 5, p: float = 2.0, weights: str = "uniform", eps: float = 1e-9):
        self.n_neighbors = int(n_neighbors)
        self.p = float(p)
        self.weights = weights
        self.eps = float(eps)

    def fit(self, X: np.ndarray, y: np.ndarray):
        self.X_ = np.asarray(X, dtype=float)
        self.y_ = np.asarray(y, dtype=float)
        return self

    def _weights(self, distances: np.ndarray) -> np.ndarray:
        if self.weights == "uniform":
            return np.ones_like(distances, dtype=float)
        if self.weights == "distance":
            return 1.0 / (distances + self.eps)
        raise ValueError("weights must be 'uniform' or 'distance'")

    def predict(self, X_query: np.ndarray) -> np.ndarray:
        idx, d = kneighbors_indices_and_distances(self.X_, np.asarray(X_query, dtype=float), self.n_neighbors, p=self.p)
        w = self._weights(d)
        y_neighbors = self.y_[idx]
        return np.sum(w * y_neighbors, axis=1) / np.sum(w, axis=1)

5.1 One query point in slow motion#

Let’s pick a single test point and literally see which points become its neighbors.

scratch_knn = ScratchKNNClassifier(n_neighbors=10, p=2, weights="distance").fit(X_tr, y_tr)

xq = X_te[0:1]
yq = y_te[0]

idx, d = kneighbors_indices_and_distances(X_tr, xq, k=10, p=2)
idx = idx[0]
d = d[0]

print("Query true class:", yq)
print("Nearest neighbor distances:", np.round(d, 3))
print("Nearest neighbor labels:", y_tr[idx])

# Plot: training points, query point, and highlight neighbors
fig = go.Figure()
fig.add_trace(go.Scatter(x=X_tr[:, 0], y=X_tr[:, 1], mode="markers",
                         marker=dict(color=y_tr, colorscale="Viridis", size=7, opacity=0.6),
                         name="train"))

fig.add_trace(go.Scatter(x=xq[:, 0], y=xq[:, 1], mode="markers",
                         marker=dict(symbol="star", size=16, color="red"),
                         name="query"))

neighbors = X_tr[idx]
fig.add_trace(go.Scatter(x=neighbors[:, 0], y=neighbors[:, 1], mode="markers",
                         marker=dict(size=12, color="black", symbol="circle-open"),
                         name="neighbors"))

# draw thin lines from query to neighbors
for pt in neighbors:
    fig.add_trace(go.Scatter(
        x=[xq[0, 0], pt[0]],
        y=[xq[0, 1], pt[1]],
        mode="lines",
        line=dict(color="rgba(0,0,0,0.15)", width=1),
        showlegend=False,
    ))

fig.update_layout(title="A query point and its 10 nearest neighbors", width=750, height=500)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
Query true class: 1
Nearest neighbor distances: [0.028 0.121 0.128 0.173 0.18  0.201 0.205 0.208 0.211 0.22 ]
Nearest neighbor labels: [1 1 1 1 1 1 1 1 1 1]

5.2 Decision boundary (from scratch)#

To visualize KNN, we’ll compute predictions on a grid and color the plane.

def plot_knn_boundary_2d(model, X_train, y_train, title: str, grid_steps: int = 220):
    x_min, x_max = X_train[:, 0].min() - 0.8, X_train[:, 0].max() + 0.8
    y_min, y_max = X_train[:, 1].min() - 0.8, X_train[:, 1].max() + 0.8

    xs = np.linspace(x_min, x_max, grid_steps)
    ys = np.linspace(y_min, y_max, grid_steps)
    xx, yy = np.meshgrid(xs, ys)
    grid = np.c_[xx.ravel(), yy.ravel()]

    if hasattr(model, "predict_proba"):
        proba = model.predict_proba(grid)
        z = proba[:, 1].reshape(xx.shape)
        z_title = "P(class=1)"
    else:
        z = model.predict(grid).reshape(xx.shape)
        z_title = "prediction"

    fig = go.Figure()
    fig.add_trace(go.Contour(
        x=xs,
        y=ys,
        z=z,
        colorscale="RdBu",
        opacity=0.75,
        contours=dict(showlines=False),
        colorbar=dict(title=z_title),
    ))

    fig.add_trace(go.Scatter(
        x=X_train[:, 0],
        y=X_train[:, 1],
        mode="markers",
        marker=dict(color=y_train, colorscale="Viridis", size=6, line=dict(width=0.5, color="white")),
        name="train",
    ))

    fig.update_layout(title=title, width=750, height=520)
    fig.update_yaxes(scaleanchor="x", scaleratio=1)
    return fig


for k in [1, 5, 25]:
    m = ScratchKNNClassifier(n_neighbors=k, p=2, weights="uniform").fit(X_tr, y_tr)
    fig = plot_knn_boundary_2d(m, X_tr, y_tr, title=f"Scratch KNN boundary (k={k}, uniform)")
    fig.show()

5.3 Bias–variance intuition: small k vs large k#

K is like a zoom level:

  • k=1: super local, very wiggly boundary → low bias, high variance

  • large k: very smooth boundary → higher bias, lower variance

A useful mental model:

“How many neighbors should I ask for advice?”

If you ask one person, you might get an extreme opinion. If you ask a hundred, you get a stable average — but you might miss niche local context.

def accuracy_for_k(k: int, weights: str = "uniform") -> float:
    model = ScratchKNNClassifier(n_neighbors=k, p=2, weights=weights).fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    return float(accuracy_score(y_te, y_pred))

ks = list(range(1, 51, 2))
acc_uniform = [accuracy_for_k(k, weights="uniform") for k in ks]
acc_distance = [accuracy_for_k(k, weights="distance") for k in ks]

fig = go.Figure()
fig.add_trace(go.Scatter(x=ks, y=acc_uniform, mode="lines+markers", name="uniform"))
fig.add_trace(go.Scatter(x=ks, y=acc_distance, mode="lines+markers", name="distance"))
fig.update_layout(
    title="Scratch KNN test accuracy vs k",
    xaxis_title="k",
    yaxis_title="accuracy",
    width=750,
    height=450,
)
fig.show()

best_k = ks[int(np.argmax(acc_distance))]
print("Best k (distance weights) in this sweep:", best_k)
Best k (distance weights) in this sweep: 25

6) KNN regression (1D example)#

Regression KNN is like:

“What’s the average target value of the nearest examples?”

It produces a piecewise smooth curve.

# 1D regression dataset
n_reg = 120
x_reg = np.linspace(-3, 3, n_reg)
noise_reg = rng.normal(0, 0.25, size=n_reg)
y_reg = np.sin(x_reg) + 0.2 * x_reg + noise_reg

X1 = x_reg.reshape(-1, 1)

# Train/test split
perm = rng.permutation(n_reg)
train_idx, test_idx = perm[:90], perm[90:]
X1_tr, y_tr_reg = X1[train_idx], y_reg[train_idx]
X1_te, y_te_reg = X1[test_idx], y_reg[test_idx]

x_grid = np.linspace(x_reg.min() - 0.2, x_reg.max() + 0.2, 500).reshape(-1, 1)

fig = px.scatter(x=X1_tr[:, 0], y=y_tr_reg, title="1D regression training data", labels={"x": "x", "y": "y"})
fig.update_traces(marker=dict(size=7, opacity=0.8))
fig.update_layout(width=750, height=450)
fig.show()
def rmse(y_true, y_pred):
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))

fig = go.Figure()
fig.add_trace(go.Scatter(x=X1_tr[:, 0], y=y_tr_reg, mode="markers", name="train", marker=dict(size=7, opacity=0.7)))

for k in [1, 5, 15, 35]:
    model = ScratchKNNRegressor(n_neighbors=k, p=2, weights="uniform").fit(X1_tr, y_tr_reg)
    y_grid = model.predict(x_grid)
    y_pred_te = model.predict(X1_te)
    fig.add_trace(go.Scatter(x=x_grid[:, 0], y=y_grid, mode="lines", name=f"k={k} (RMSE={rmse(y_te_reg, y_pred_te):.3f})"))

fig.update_layout(title="Scratch KNN regression: effect of k", xaxis_title="x", yaxis_title="y", width=900, height=500)
fig.show()

6.1 Uniform vs distance weights (regression)#

With distance weights, closer points count more.

  • helps when the “right answer” is very local

  • can be noisier when points are extremely close (hence the small \(\varepsilon\))

fig = go.Figure()
fig.add_trace(go.Scatter(x=X1_tr[:, 0], y=y_tr_reg, mode="markers", name="train", marker=dict(size=7, opacity=0.7)))

k = 15
m_uniform = ScratchKNNRegressor(n_neighbors=k, p=2, weights="uniform").fit(X1_tr, y_tr_reg)
m_distance = ScratchKNNRegressor(n_neighbors=k, p=2, weights="distance").fit(X1_tr, y_tr_reg)

fig.add_trace(go.Scatter(x=x_grid[:, 0], y=m_uniform.predict(x_grid), mode="lines", name="uniform"))
fig.add_trace(go.Scatter(x=x_grid[:, 0], y=m_distance.predict(x_grid), mode="lines", name="distance"))

fig.update_layout(title=f"Scratch KNN regression: weights (k={k})", xaxis_title="x", yaxis_title="y", width=900, height=500)
fig.show()

7) Why scaling matters (a dramatic demo)#

We’ll make a dataset with two features:

  • \(x_1\) in range ~[-2, 2]

  • \(x_2\) in range ~[0, 2000]

If we don’t scale, \(x_2\) dominates the distance — and KNN behaves as if \(x_1\) barely exists.

# Make a scaled-up second feature that should NOT dominate the decision
X_bad = X.copy()
X_bad[:, 1] = (X_bad[:, 1] - X_bad[:, 1].min()) / (np.ptp(X_bad[:, 1]) + 1e-9)  # normalize to [0,1]
X_bad[:, 1] = X_bad[:, 1] * 2000  # huge scale

Xb_tr, Xb_te, yb_tr, yb_te = train_test_split(X_bad, y, test_size=0.3, random_state=7, stratify=y)

# Scratch KNN without scaling
m_raw = ScratchKNNClassifier(n_neighbors=15, weights="uniform").fit(Xb_tr, yb_tr)
acc_raw = accuracy_score(yb_te, m_raw.predict(Xb_te))

# sklearn Pipeline with scaling
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=15, weights="uniform")),
])
pipe.fit(Xb_tr, yb_tr)
acc_scaled = accuracy_score(yb_te, pipe.predict(Xb_te))

print(f"Accuracy without scaling: {acc_raw:.3f}")
print(f"Accuracy with scaling   : {acc_scaled:.3f}")
Accuracy without scaling: 0.880
Accuracy with scaling   : 0.973
# Plot boundaries (note: x2 scale is huge)
fig1 = plot_knn_boundary_2d(m_raw, Xb_tr, yb_tr, title="Scratch KNN boundary (NO scaling)")
fig1.show()

# Boundary after scaling using sklearn model: we need a wrapper with predict_proba
class _SklearnWrapper:
    def __init__(self, model):
        self.model = model
    def predict_proba(self, Xq):
        return self.model.predict_proba(Xq)

fig2 = plot_knn_boundary_2d(_SklearnWrapper(pipe), Xb_tr, yb_tr, title="sklearn KNN boundary (StandardScaler + KNN)")
fig2.show()

8) Curse of dimensionality (why KNN can struggle in high dimensions)#

In high dimensions, points tend to become “equally far apart”.

A classic demo:

  • sample points uniformly in \([0,1]^d\)

  • look at the ratio: (nearest distance) / (farthest distance)

As dimension \(d\) grows, the ratio approaches 1.

That means:

the nearest neighbor isn’t that much nearer than a random point

So KNN can lose its “locality” advantage unless you reduce dimension or have a huge dataset.

def nn_distance_ratio(dim: int, n_points: int = 1500, seed: int = 7) -> float:
    r = np.random.default_rng(seed + dim)
    Xd = r.random((n_points, dim))
    xq = r.random((1, dim))

    dists = pairwise_minkowski_distances(xq, Xd, p=2)[0]
    return float(dists.min() / dists.max())


dims = list(range(1, 51, 2))
ratios = [nn_distance_ratio(d) for d in dims]

fig = go.Figure()
fig.add_trace(go.Scatter(x=dims, y=ratios, mode="lines+markers"))
fig.update_layout(
    title="Nearest/farthest distance ratio vs dimension",
    xaxis_title="dimension d",
    yaxis_title="min(dist)/max(dist)",
    width=750,
    height=450,
)
fig.show()

9) Practical KNN with scikit-learn#

sklearn provides optimized KNN implementations with a clean interface.

9.1 Key parameters (Classifier / Regressor)#

  • n_neighbors: K

  • weights: 'uniform' or 'distance' (or a custom function)

  • p: Minkowski distance power (p=2 Euclidean, p=1 Manhattan)

  • metric: distance metric (default 'minkowski')

  • algorithm: 'auto', 'ball_tree', 'kd_tree', 'brute'

  • leaf_size: affects tree-based neighbor search performance

  • n_jobs: parallelism (when supported)

A good default starting point:

  • scale features with StandardScaler

  • try a small sweep over n_neighbors

  • compare weights='uniform' vs weights='distance'

# A simple sklearn baseline
sk_knn = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=15, weights="distance", p=2)),
])

sk_knn.fit(X_tr, y_tr)
y_pred = sk_knn.predict(X_te)
print("Test accuracy:", accuracy_score(y_te, y_pred))
Test accuracy: 0.9466666666666667

9.2 A small grid search (select K + weights)#

This is the “grown-up” version of our manual accuracy sweep.

param_grid = {
    "knn__n_neighbors": list(range(1, 51, 2)),
    "knn__weights": ["uniform", "distance"],
    "knn__p": [1, 2],
}

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier()),
])

gs = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)
gs.fit(X_tr, y_tr)

print("Best params:", gs.best_params_)
print("Best CV score:", gs.best_score_)
print("Test accuracy:", accuracy_score(y_te, gs.best_estimator_.predict(X_te)))
Best params: {'knn__n_neighbors': 7, 'knn__p': 1, 'knn__weights': 'uniform'}
Best CV score: 0.9314285714285715
Test accuracy: 0.9533333333333334

10) Summary: when to use KNN#

Great when:

  • you want a strong baseline with minimal assumptions

  • the decision boundary is complex and you have enough data

  • interpretability as “similar examples” is valuable

Be careful when:

  • features have different scales (always scale)

  • you’re in very high dimensions (consider PCA / feature selection)

  • you need fast predictions at huge scale (consider approximate nearest neighbors or different models)

KNN is like a good memory + a sensible notion of similarity.

Exercises#

  1. Implement a custom distance that weighs features differently.

  2. Add tie-breaking rules explicitly for classification.

  3. Compare p=1 vs p=2 on make_moons.

  4. For regression, compare KNN to linear regression on the same dataset.

References#

  • scikit-learn docs: KNeighborsClassifier / KNeighborsRegressor

  • “The Elements of Statistical Learning” — KNN discussion